Capstone Project - Covid in South Korea¶

imatge.png

DISCLAIMER: When viewing this notebook on Github¶

Open the notebook in jupyterlab for optimal experience.

Some charts/diagrams/features are not visible in github.

This impacts all:

  • plotly plots
  • folium maps
  • embedded images
  • and all other dynamic content

This is standard and well-known behaviour.

While some workarounds could be possible, there are no fixes for all of the issues.

  • Plotly customization requires installing external tools, but doesn't fix the other issues
  • Standard workarounds (nbviewer) do not work because the github repo is private.

Workaround¶

Open the notebook in jupyterlab for optimal experience.

Goals¶

  1. You [...] have to create and prove a plan of fighting the pandemics in your country by analyzing the provided data.
  2. You must get the most critical insights using learned data science techniques

Loading Data¶

In [1]:
display()
In [2]:
%reload_ext autoreload
%autoreload 1

import matplotlib.pyplot as plt
import matplotlib.patches as patches
import plotly.graph_objects as go
import plotly.express as px
import numpy as np
import pandas as pd
import folium
from folium import plugins

import seaborn as sns
from functools import lru_cache

from itertools import product

from random import random, seed

from utils import edu_utils as u
from utils import auto_kaggle
from utils import chart_utils as charts
from utils import folium_utils
from utils import pandas_auto_types as pat


%aimport utils.edu_utils
%aimport utils.auto_kaggle
%aimport utils.chart_utils
%aimport utils.folium_utils
In [3]:
seed(100)

pd.options.display.max_rows = 100

u.check("done")
yes! ✅

Auto-Fetching Dataset from Kaggle¶

In [4]:
# covid
dataset = "kimjihoo/coronavirusdataset"
csv = "SeoulFloating.csv"
auto_kaggle.download_dataset(dataset, csv)
Kaggle API 1.5.12 - login as 'edualmas'
Dataset kimjihoo/coronavirusdataset	 Skipped

Auto-Detecting Data Types¶

I created a small utility library (pandas_auto_types - pat) with custom functions to perform common tasks on datasets

  • loading from CSVs
  • auto-casting (ints, floats, ...)
  • auto-parsing (dates, timestamps, ...)
  • auto-detection of categorical fields
  • auto-detection of id fields
  • while allowing common customizations:
    • values to treat as NAs
    • skipping conversions for some types
    • thresholds to consider "categorical" field
In [5]:
read_csv_kws = {"na_values": [" ", "-"]}

# Cases and Patients
raw_case = pat.read_csv(
    "dataset/Case.csv", read_csv_kws, category_threshold_percent=0.3
)
raw_patientInfo = pat.read_csv(
    "dataset/PatientInfo.csv",
    read_csv_kws,
    convert_dtypes_kws={"convert_integer": False},
).set_index("patient_id")

# Timelines and trends
raw_weather = pat.read_csv( 
    "dataset/Weather.csv", read_csv_kws, category_threshold_percent=0.01
)
raw_timeAge = pat.read_csv("dataset/TimeAge.csv", read_csv_kws)
raw_time = pat.read_csv("dataset/Time.csv", read_csv_kws)
raw_timeGender = pat.read_csv("dataset/TimeGender.csv", read_csv_kws)
raw_timeProvince = pat.read_csv("dataset/TimeProvince.csv", read_csv_kws)
raw_seoulFloating = pat.read_csv("dataset/SeoulFloating.csv", read_csv_kws)

# online searching trends
raw_searchTrend = pat.read_csv("dataset/SearchTrend.csv", read_csv_kws)

# Gov. Policies
raw_policy = pat.read_csv("dataset/Policy.csv", read_csv_kws)

# Country statistics
raw_region = pat.read_csv("dataset/Region.csv", read_csv_kws)

### Let's inspect all the datasets we have in memory:
all_raw_dfs = u.all_vars_with_prefix("raw_", locals())
all_raw_dfs.keys()
Out[5]:
dict_keys(['raw_case', 'raw_patientInfo', 'raw_weather', 'raw_timeAge', 'raw_time', 'raw_timeGender', 'raw_timeProvince', 'raw_seoulFloating', 'raw_searchTrend', 'raw_policy', 'raw_region'])

Manual Data Cleaning¶

  • [x] Explore Data
    • [x] Validate auto-detected data types
    • [x] Manually correct data types (if necessary)
    • [x] Handling NAs
    • [x] Drop times from daily snapshots

Review auto-detected ndtypes¶

Let's check the ndtypes of our dataframes

We're particularly interested in checking the datatypes that our custom functions determined for each of the columns

Let's do a quick inspection to all the raw_* dataframes

In [6]:
for name, df in all_raw_dfs.items():
    print("#" * 5, name, "#" * (50 - len(name)))
    print(df.dtypes, "\n")
##### raw_case ##########################################
case_id             object
province          category
city              category
group              boolean
infection_case      string
confirmed            Int64
latitude           Float64
longitude          Float64
dtype: object 

##### raw_patientInfo ###################################
sex                         category
age                         category
country                     category
province                    category
city                        category
infection_case              category
infected_by                   string
contact_number               Float64
symptom_onset_date    datetime64[ns]
confirmed_date        datetime64[ns]
released_date         datetime64[ns]
deceased_date         datetime64[ns]
state                       category
dtype: object 

##### raw_weather #######################################
code                              Int64
province                       category
date                     datetime64[ns]
avg_temp                        Float64
min_temp                        Float64
max_temp                        Float64
precipitation                   Float64
max_wind_speed                  Float64
most_wind_direction               Int64
avg_relative_humidity           Float64
dtype: object 

##### raw_timeAge #######################################
date         datetime64[ns]
time                  Int64
age                category
confirmed             Int64
deceased              Int64
dtype: object 

##### raw_time ##########################################
date         datetime64[ns]
time                  Int64
test                  Int64
negative              Int64
confirmed             Int64
released              Int64
deceased              Int64
dtype: object 

##### raw_timeGender ####################################
date         datetime64[ns]
time                  Int64
sex                category
confirmed             Int64
deceased              Int64
dtype: object 

##### raw_timeProvince ##################################
date         datetime64[ns]
time                  Int64
province           category
confirmed             Int64
released              Int64
deceased              Int64
dtype: object 

##### raw_seoulFloating #################################
date          datetime64[ns]
hour                   Int64
birth_year             Int64
sex                 category
province            category
city                category
fp_num                 Int64
dtype: object 

##### raw_searchTrend ###################################
date           datetime64[ns]
cold                  Float64
flu                   Float64
pneumonia             Float64
coronavirus           Float64
dtype: object 

##### raw_policy ########################################
policy_id             object
country             category
type                  string
gov_policy            string
detail                string
start_date    datetime64[ns]
end_date      datetime64[ns]
dtype: object 

##### raw_region ########################################
code                           Int64
province                    category
city                          string
latitude                     Float64
longitude                    Float64
elementary_school_count        Int64
kindergarten_count             Int64
university_count               Int64
academy_ratio                Float64
elderly_population_ratio     Float64
elderly_alone_ratio          Float64
nursing_home_count             Int64
dtype: object 

Manual Data Cleaning¶

So far, we have used custom utility functions that will automatically identify ndtypes.

We still see, however, some minor issues (unexpected data types), which we want to fix manually:

  • [x] raw_patientInfo
    • [x] Several fields with missing data (% NAs)
  • [x] raw_timeAge
    • [x] time field is all '0'. can be deleted.
  • [x] raw_time
    • [x] time is inconsistent (0s or 16s but only 1 entry per day). can be dropped
  • [x] raw_timeGender
    • [x] time field is all '0'. can be deleted.
  • [x] raw_timeProvince
    • [x] time field is all '0'. can be deleted.
  • [x] raw_seoulFloating
    • [x] birth_year is int64 but should be category (age bucket)

As we get raw_ dfs fixed, we will create a copy that we can use for EDA (with the df_ prefix) , and we'll drop the raw_ ones to save memory

Dataset patientInfo¶

  • [x] NAs: Lots of missing values
    • [x] drop fields with more than 75% NAs
      • [x] contact_number
      • [x] sympton_onset_date
    • [x] add "unknown" value to categorical fields infection_case, sex and age
    • [x] drop rows that still have NAs (eg. city: 1% of rows)
  • [x] let's validate that when state is "deceased", all rows have a deceased_date
  • [x] let's validate that when state is "released", all rows have a released_date

Dropping fields with more than 75% NAs¶

Let's calculate the % of missing data, for each column

In [7]:
pat.calculate_na_percent(raw_patientInfo)
Out[7]:
sex                   21
age                   26
country                0
province               0
city                   1
infection_case        17
infected_by           73
contact_number        84
symptom_onset_date    86
confirmed_date         0
released_date         69
deceased_date         98
state                  0
dtype: int64

infected_by has 73% of NAs, but that's normal because this value is only relevant for a subset of patients (those with "contact with patient" infection_case). We will keep it as it will allow us to do contact tracing.

In [8]:
raw_patientInfo.drop(["contact_number", "symptom_onset_date"], axis=1, inplace=True)
raw_patientInfo.head()
Out[8]:
sex age country province city infection_case infected_by confirmed_date released_date deceased_date state
patient_id
1000000001 male 50s Korea Seoul Gangseo-gu overseas inflow <NA> 2020-01-23 2020-02-05 NaT released
1000000002 male 30s Korea Seoul Jungnang-gu overseas inflow <NA> 2020-01-30 2020-03-02 NaT released
1000000003 male 50s Korea Seoul Jongno-gu contact with patient 2002000001 2020-01-30 2020-02-19 NaT released
1000000004 male 20s Korea Seoul Mapo-gu overseas inflow <NA> 2020-01-30 2020-02-15 NaT released
1000000005 female 20s Korea Seoul Seongbuk-gu contact with patient 1000000002 2020-01-31 2020-02-24 NaT released

Adding unknown category to infection_case, sex and age¶

In [9]:
raw_patientInfo["sex"] = pat.coalesce_categorical(raw_patientInfo["sex"])
raw_patientInfo["age"] = pat.coalesce_categorical(raw_patientInfo["age"])
raw_patientInfo["infection_case"] = pat.coalesce_categorical(
    raw_patientInfo["infection_case"]
)
raw_patientInfo["city"] = pat.coalesce_categorical(raw_patientInfo["city"])

Validating Deceased entries¶

let's validate that all rows marked as deceased, have a deceased_date

In [10]:
has_deceased_date = raw_patientInfo["deceased_date"].notna()
is_marked_as_deceased = raw_patientInfo["state"] == "deceased"
deceased_matches = is_marked_as_deceased == has_deceased_date

u.check(len(deceased_matches[deceased_matches == False]) == 0)
no! ❌

It seems there are some values that are marked as deceased but they are missing the deceased_date

In [11]:
raw_patientInfo[deceased_matches == False].loc[
    :, ["confirmed_date", "deceased_date", "state"]
]
Out[11]:
confirmed_date deceased_date state
patient_id
1000000013 2020-02-16 NaT deceased
1000000109 2020-03-07 NaT deceased
1000000285 2020-03-19 NaT deceased
1000000473 2020-03-31 NaT deceased
1000000997 2020-06-08 NaT deceased
1000001062 2020-06-11 NaT deceased
1000001118 2020-06-14 NaT deceased
1100000071 2020-02-28 NaT deceased
1100000095 2020-03-13 NaT deceased
1100000097 2020-03-13 NaT deceased
6002000002 2020-02-22 NaT deceased
6022000049 2020-03-15 NaT deceased

We don't have deceased date for everyone. We'll keep a small note and move on.

Validating Released entries¶

We can do the same analysis for Released patients, with an added complexity: Patients can have a "release_date" while no longer being in state "released" (they might have transitioned to another status after being released)

This was not possible for previous cases (deceased), which made the previous logic simple.

For the check we want to perform on our released_date, we will require a different boolean operation. Pandas does not have an implementation for the IMP operator (Material Implication), but we have created an implementation in our utilities.

Note that the IMP operator is not commutative, so the order of conditions matters.

This will allow us to understand if there are any cases where patients are marked as released but don't have a release date (which would be a problem). We don't care about other conditions (has a release date, but is no longer marked as "released", etc..)

In [12]:
marked_as_released = raw_patientInfo["state"] == "released"
should_have_release_date = raw_patientInfo["released_date"].notna()
invalid_rows = u.assert_imp(
    raw_patientInfo, marked_as_released, should_have_release_date
)

u.check(0 == len(invalid_rows))
print(f"{len(invalid_rows)} released patients without a release date")
invalid_rows.head()
no! ❌
1350 released patients without a release date
Out[12]:
sex age country province city infection_case infected_by confirmed_date released_date deceased_date state
patient_id
1000000015 male 70s Korea Seoul Seongdong-gu Seongdong-gu APT <NA> 2020-02-19 NaT NaT released
1000000018 male 20s Korea Seoul etc etc <NA> 2020-02-20 NaT NaT released
1000000020 female 70s Korea Seoul Seongdong-gu Seongdong-gu APT 1000000015 2020-02-20 NaT NaT released
1000000022 male 30s Korea Seoul Seodaemun-gu Eunpyeong St. Mary's Hospital <NA> 2020-02-21 NaT NaT released
1000000023 male 50s Korea Seoul Seocho-gu Shincheonji Church <NA> 2020-02-21 NaT NaT released

It seems that the dataset is missing some data:

  • not all "deceased" people have a deceased date
  • not all "released" people have a released date

This might extend to other gaps in the data. We will not make any further checks, but it does give us some understanding around how complete/cohesive the data is.

In [13]:
df_patientInfo = raw_patientInfo.copy()
del raw_patientInfo

Datasets for Daily Snapshots¶

Dropping Time columns for daily aggregates¶

We have previously identified 4 dataframes that have daily aggregates, with additional slicings (by region, gender, age group), where the time column is not meaningful (since there is only 1 row per day-and-slicing)

  1. raw_timeAge
  2. raw_time
  3. raw_timeGender
  4. raw_timeProvince

A quick check finds that all rows have the same time value ("0")

In [14]:
print(pd.unique(raw_timeAge["time"]))
print(pd.unique(raw_timeGender["time"]))
print(pd.unique(raw_time["time"]))
print(pd.unique(raw_timeProvince["time"]))
<IntegerArray>
[0]
Length: 1, dtype: Int64
<IntegerArray>
[0]
Length: 1, dtype: Int64
<IntegerArray>
[16, 0]
Length: 2, dtype: Int64
<IntegerArray>
[16, 0]
Length: 2, dtype: Int64

The first two, we can drop without looking back. They only contain 0.

In [15]:
df_timeAge = raw_timeAge.drop("time", axis=1)
del raw_timeAge
df_timeAge.columns
Out[15]:
Index(['date', 'age', 'confirmed', 'deceased'], dtype='object')
In [16]:
df_timeGender = raw_timeGender.drop("time", axis=1)
del raw_timeGender
df_timeGender.columns
Out[16]:
Index(['date', 'sex', 'confirmed', 'deceased'], dtype='object')

The other two will require one extra check to make sure that we're not accidentally deleting valuable data

  • raw_time, has 1 row per day
  • raw_timeProvince, has 17 rows per day/province (1 per day x 17 provinces)
In [17]:
rows_per_day = raw_time["date"].value_counts()
u.check(0 == len(rows_per_day[rows_per_day != 1]))
yes! ✅
In [18]:
rows_per_day = raw_timeProvince["date"].value_counts()
u.check(0 == len(rows_per_day[rows_per_day != 17]))
yes! ✅

With this we confirm that no single day has duplicated entries and that we can easily drop those time columns as well

In [19]:
df_time = raw_time.drop("time", axis=1)
del raw_time
df_time.columns
Out[19]:
Index(['date', 'test', 'negative', 'confirmed', 'released', 'deceased'], dtype='object')
In [20]:
df_timeProvince = raw_timeProvince.drop("time", axis=1)
del raw_timeProvince
df_timeProvince.columns
Out[20]:
Index(['date', 'province', 'confirmed', 'released', 'deceased'], dtype='object')

Other Datasets¶

Dataset seoulFloating¶

We want to convert birth_year from numerical series to a categorical one, with a concrete natural ordering

In [21]:
df_seoulFloating = raw_seoulFloating.copy()
del raw_seoulFloating
df_seoulFloating["age"] = df_seoulFloating["birth_year"].astype("str") + "s"
df_seoulFloating = df_seoulFloating[
    ["date", "hour", "age", "sex", "province", "city", "fp_num"]
]
df_seoulFloating = pat.categorise_column_ordered(
    df_seoulFloating, "age", ["20s", "30s", "40s", "50s", "60s", "70s"]
)
df_seoulFloating.head()
Out[21]:
date hour age sex province city fp_num
0 2020-01-01 0 20s female Seoul Dobong-gu 19140
1 2020-01-01 0 20s male Seoul Dobong-gu 19950
2 2020-01-01 0 20s female Seoul Dongdaemun-gu 25450
3 2020-01-01 0 20s male Seoul Dongdaemun-gu 27050
4 2020-01-01 0 20s female Seoul Dongjag-gu 28880
In [22]:
u.check(
    0 == df_seoulFloating["age"].isna().sum()
)  # our categorical type captured all values
yes! ✅
In [23]:
dataset_creation_date = df_seoulFloating["date"].max().strftime("%Y-%m-%d")
dataset_creation_date
Out[23]:
'2020-05-31'

Dataset Policy¶

raw_policy has empty cells, but they can be ignored. Most of them are on "end_date" which likely means that the policy was still in place when this dataset was created.

the other two are "policy details" that might not be applicable. no need to drop, the data makes sense with all of these NAs

In [24]:
raw_policy.isna().sum()
Out[24]:
policy_id      0
country        0
type           0
gov_policy     0
detail         2
start_date     0
end_date      37
dtype: int64
In [25]:
policy_without_enddate = raw_policy.loc[:, raw_policy.columns != "end_date"]

policy_without_enddate[policy_without_enddate.isna().sum(axis=1) > 0]
Out[25]:
policy_id country type gov_policy detail start_date
50 51 Korea Technology Self-Diagnosis App <NA> 2020-02-12
51 52 Korea Technology Self-Quarantine Safety Protection App <NA> 2020-03-07
In [26]:
df_policy = raw_policy
del raw_policy

Dataset Weather¶

This dataset has a few NA values

We could drop all data previous to 2020, but we're probably better off keeping it all, with gaps, in case we need to find trends across various years. We can ignore/drop data during the analysis phase

In [27]:
raw_weather.isna().sum()
Out[27]:
code                      0
province                  0
date                      0
avg_temp                 15
min_temp                  5
max_temp                  3
precipitation             0
max_wind_speed            9
most_wind_direction      29
avg_relative_humidity    20
dtype: int64
In [28]:
raw_weather["date"] = pd.to_datetime(raw_weather["date"], infer_datetime_format=True)
In [29]:
raw_weather[raw_weather["date"] >= "2020-01-01"].isna().sum()
Out[29]:
code                     0
province                 0
date                     0
avg_temp                 0
min_temp                 0
max_temp                 0
precipitation            0
max_wind_speed           0
most_wind_direction      1
avg_relative_humidity    0
dtype: int64
In [30]:
df_weather = raw_weather
del raw_weather

Creating the rest of the EDA-ready dataframes¶

As we have been fixing some of the raw_ dataframes, we've been creating eda-ready DFs.

The rest of the raw dataframes were good enough to require no manual tweaking, let's create the EDA-ready ones and drop the remaining raw dfs.

In [31]:
u.all_vars_with_prefix("raw_", locals()).keys()
Out[31]:
dict_keys(['raw_case', 'raw_searchTrend', 'raw_region'])
In [32]:
df_case = raw_case.copy()
del raw_case

df_searchTrend = raw_searchTrend.copy()
del raw_searchTrend

df_region = raw_region.copy()
del raw_region
In [33]:
u.check(0 == len(u.all_vars_with_prefix("raw_", locals()).keys()))
u.check(11 == len(u.all_vars_with_prefix("df_", locals()).keys()))
yes! ✅
yes! ✅

The Datasets at a glance¶

In this initial section of actual exploration, we want to get a general sense for the data, find general trends and develop some hypothesis so we can validate them later.

We will start from the most inocuous and simple to the more complex, so we can progessively build up our contextual domain knowledge (around Korea as a country, covid as an epidemic, and their interaction: spread vs weather patterns vs policies, etc.. )

Region Dataset¶

The first thing we want to do is get a general understanding of Korea. As an analyst with very little context, we want to get a general feel for:

  • [x] Which regions have the most at-risk citizens
  • [x] What are the most populated regions

Elderly Citizens¶

In [34]:
df_region.head()
Out[34]:
code province city latitude longitude elementary_school_count kindergarten_count university_count academy_ratio elderly_population_ratio elderly_alone_ratio nursing_home_count
0 10000 Seoul Seoul 37.566953 126.977977 607 830 48 1.44 15.38 5.8 22739
1 10010 Seoul Gangnam-gu 37.518421 127.047222 33 38 0 4.18 13.17 4.3 3088
2 10020 Seoul Gangdong-gu 37.530492 127.123837 27 32 0 1.54 14.55 5.4 1023
3 10030 Seoul Gangbuk-gu 37.639938 127.025508 14 21 0 0.67 19.49 8.5 628
4 10040 Seoul Gangseo-gu 37.551166 126.849506 36 56 1 1.17 14.39 5.7 1080
In [35]:
province_analysis = df_region.copy()

region_elderly_order = (
    province_analysis[["province", "elderly_population_ratio"]]
    .groupby("province")
    .mean()
    .sort_values("elderly_population_ratio", ascending=False)
)

region_elderly = province_analysis[["province", "city", "elderly_population_ratio"]]
In [36]:
nursing_home_by_province = (
    province_analysis[["province", "nursing_home_count"]]
    .groupby("province")
    .sum()
    .sort_values("nursing_home_count", ascending=False)
)
nursing_home_order = nursing_home_by_province.index.drop("Korea")
In [37]:
f, (ax_elderly, ax_nursing) = plt.subplots(1, 2, figsize=(12, 7))
plt.gcf().tight_layout(pad=10.0)

sns.barplot(
    data=region_elderly,
    x="elderly_population_ratio",
    y="province",
    order=region_elderly_order.index,
    ax=ax_elderly,
)

sns.barplot(
    data=nursing_home_by_province,
    x="nursing_home_count",
    y=nursing_home_by_province.index,
    order=region_elderly_order.index,
    ax=ax_nursing,
)

plt.show()

Based on this overview, we want to keep an eye on the following regions, due to their at-risk population:

High population in nursing homes (absolute numbers):

  • Seoul
  • Gyeonggi-do

High % of elderly population

  • Jeollanam-do
  • Gyeongsangbuk-do
  • Jeollabuk-do

By crossing information from both graphs, we can also hypothesize that Seoul and Gyeonggi-do also have the highest population density in the country, because:

  • they have a lot of nursing homes (abs), and yet their elderly population ratio is on par with the rest of the country.

Let's verify this by crossing a couple of additional data points to get a better picture:

Population Density¶

In [38]:
province_analysis["infrastructure_buildings"] = (
    province_analysis["elementary_school_count"]
    + province_analysis["kindergarten_count"]
    + province_analysis["university_count"]
    + province_analysis["nursing_home_count"]
)

population_estimate = (
    province_analysis[["province", "infrastructure_buildings"]]
    .groupby("province")
    .sum()
)


sns.barplot(
    data=population_estimate,
    y=population_estimate.index,
    x=population_estimate["infrastructure_buildings"],
)
plt.title("Infrastructure Buildings by province")
Out[38]:
Text(0.5, 1.0, 'Infrastructure Buildings by province')

Even though, we cannot calculate the population density for each province, we can get an approximate idea of the areas with more public buildings, which can be an indicator of overal population distribution

We can tentatively confirm our previous hypothesis for Seoul and Gyeonggi-do, since the trend also extends to other buildings, not just nursing homes.

TimeProvince Dataset¶

Continuing from our previous visualization, let's try to get an understanding of how each province has evolved during the first months of this pandemic

In [39]:
cases_by_province = df_timeProvince.groupby(["province", "date"]).sum(numeric_only=True)
cases_by_province = cases_by_province.reset_index().melt(
    id_vars=["province", "date"],
    value_vars=["confirmed", "released", "deceased"],
    var_name="metric",
    value_name="cases",
)
cases_by_province.head()
Out[39]:
province date metric cases
0 Busan 2020-01-20 confirmed 0
1 Busan 2020-01-21 confirmed 0
2 Busan 2020-01-22 confirmed 0
3 Busan 2020-01-23 confirmed 0
4 Busan 2020-01-24 confirmed 0
In [40]:
# We can use a variation of a standard RAG color coding:

Amber = "#ffa000"
covid_palette = {
    "deceased": "Red",
    "confirmed": Amber,
    "released": "Green",
    "negative": "Black",
    "test": "Gray",
}
In [41]:
g = sns.FacetGrid(cases_by_province, col="province", col_wrap=5)
g.map(sns.lineplot, "date", "cases", "metric", palette=covid_palette)
g.set_xticklabels("")
g.add_legend()
Out[41]:
<seaborn.axisgrid.FacetGrid at 0x7f0260458400>

Even with such little data, we can already confirm a few things:

  • Number of deceased patients is very low for a country of 51M people
  • Number of deceased patients shows minor changes over time across all regions
  • The 4 different provinces showed a bump in cases, with Daegu province showing a major explosion of cases.

Let's drill down and compare numbers of confirmed cases across the most impacted regions

  • Daegu
  • Gyeonggi-do
  • Gyeongsangbuk-do
  • Seoul
In [42]:
incrementalCasesPerDay = cases_by_province[
    ["date", "metric", "province", "cases"]
].copy()
incrementalCasesPerDay["province"] = incrementalCasesPerDay["province"].astype(str)

incrementalCasesPerDay = incrementalCasesPerDay[
    incrementalCasesPerDay["province"].isin(
        ["Daegu", "Gyeonggi-do", "Gyeongsangbuk-do", "Seoul"]
    )
]
incrementalCasesPerDay = incrementalCasesPerDay.set_index(
    ["date", "metric", "province"]
)

incrementalCasesPerDay = (
    incrementalCasesPerDay.unstack()  # unstack metric to columns
    .unstack()  # unstack province to columns
    .diff()  # calculate diff with previous row
    .fillna(0)  # set to 0 for first row
    .stack()  # stack province back unto rows
    .stack()  # stack metric back unto rows
)

incrementalCasesPerDay = incrementalCasesPerDay.rename(columns={"cases": "new_cases"})
incrementalCasesPerDay[incrementalCasesPerDay["new_cases"] < 0]
Out[42]:
new_cases
date metric province
2020-03-30 released Seoul -1
2020-04-17 released Gyeongsangbuk-do -7
2020-05-13 released Daegu -9
2020-06-27 deceased Gyeonggi-do -1

It seems that there are some minor cases where a day's values were adjusted. nothing major, we won't be changing those or resetting to 0, and we'll take at face value.

These things can happen in a global pandemic. Everyone knows that.

Top 4 Provinces¶

Let's plot these 4 provinces so we can zoom and get a better understanding

In [43]:
g = sns.FacetGrid(incrementalCasesPerDay.reset_index(), col="province")
g.map(sns.lineplot, "date", "new_cases", "metric", palette=covid_palette)
charts.rotate_x_labels(g)
g.add_legend()
Out[43]:
<seaborn.axisgrid.FacetGrid at 0x7f02a111d6d0>

Daegu¶

In [44]:
daegu_cases = incrementalCasesPerDay.reset_index()
daegu_cases = daegu_cases[daegu_cases["province"] == "Daegu"]
sns.lineplot(daegu_cases, x="date", y="new_cases", hue="metric", palette=covid_palette)
plt.title("Daegu")
Out[44]:
Text(0.5, 1.0, 'Daegu')

Seoul, Gyeonggi-do and Gyeongsangbuk-do¶

In [45]:
top4_minus_Daegu = incrementalCasesPerDay.reset_index()
top4_minus_Daegu = top4_minus_Daegu[top4_minus_Daegu["province"] != "Daegu"]
g = sns.FacetGrid(top4_minus_Daegu, col="province", col_wrap=3)
g.map(sns.lineplot, "date", "new_cases", "metric", palette=covid_palette)

charts.rotate_x_labels(g)

TimeGender Dataset¶

Let's try to do the same analysis, but by gender instead of province to see if we can spot any major differences.

We're expecting to find no significant differences.

In [46]:
inc_gender = df_timeGender.set_index(["date", "sex"]).unstack()
inc_gender = inc_gender.diff()[
    1:
]  # drop first row. we cannot compare it with the previous value
inc_gender
Out[46]:
confirmed deceased
sex female male female male
date
2020-03-03 381 219 3 3
2020-03-04 330 186 0 4
2020-03-05 285 153 2 1
2020-03-06 322 196 3 4
2020-03-07 306 177 1 1
... ... ... ... ...
2020-06-26 15 24 0 0
2020-06-27 23 28 0 0
2020-06-28 24 38 0 0
2020-06-29 22 20 0 0
2020-06-30 18 25 0 0

120 rows × 4 columns

In [47]:
sns.lineplot(inc_gender["confirmed"], sizes=[5] * 120)
charts.rotate_x_labels()
In [48]:
sns.lineplot(inc_gender["deceased"], sizes=[5] * 120)
charts.rotate_x_labels()

From this chart we can extract a few interesting insights:

  • The initial burst of confirmed cases slowly ramped down to a fraction after 2 weeks
  • The initial burst impacted way more female patients than male patients. It's unclear what caused it, for now.
  • Despite the larger number of confirmed cases in female patients, at the beginning of march, our figures shows that ultimately, more male patients died from it.

Even though it's important to remember that each number and peak in this graph represents a tragedy and the loss of a human life, and without wanting to minimize any of it, it's important to keep in mind that the population of South Korea is over 51M people, and that these numbers showcase an outstanding performance in terms of damage control.

The analysis of this dataset, if anything, reinforces that we are analysing a true gem around how to handle a pandemic like the COVID one, when it comes to saving lives.

TimeAge Dataset¶

We can also take a quick glance at the impact distribution across age groups

In [49]:
df_timeAge
Out[49]:
date age confirmed deceased
0 2020-03-02 0s 32 0
1 2020-03-02 10s 169 0
2 2020-03-02 20s 1235 0
3 2020-03-02 30s 506 1
4 2020-03-02 40s 633 1
... ... ... ... ...
1084 2020-06-30 40s 1681 3
1085 2020-06-30 50s 2286 15
1086 2020-06-30 60s 1668 41
1087 2020-06-30 70s 850 82
1088 2020-06-30 80s 556 139

1089 rows × 4 columns

In [50]:
daily_timeAge = (
    df_timeAge[["date", "age", "confirmed", "deceased"]]
    .set_index(["date", "age"])
    .unstack()
    .copy()
)
inc_daily_timeAge = daily_timeAge.diff()
inc_daily_timeAge = inc_daily_timeAge.stack().swaplevel().sort_index()
inc_daily_timeAge
Out[50]:
confirmed deceased
age date
0s 2020-03-03 2 0
2020-03-04 0 0
2020-03-05 4 0
2020-03-06 7 0
2020-03-07 7 0
... ... ... ...
80s 2020-06-26 2 0
2020-06-27 2 0
2020-06-28 1 0
2020-06-29 0 0
2020-06-30 0 0

1080 rows × 2 columns

Confirmed¶

In [51]:
def draw_public_holidays(y, **kw):
    # plt.axhline(y=y.max(), color="r", dashes=(2, 1), linewidth=0.4)
    draw_holiday(2020, 3, 1, "Samiljeol", 1)
    draw_holiday(2020, 5, 5, "Eorininal", 2, "right")
    draw_holiday(2020, 5, 7, "Bucheonnim Osinnal", 3)
    draw_holiday(2020, 6, 6, "Hyeonchung-il", 4)


def draw_holiday(
    y: int, m: int, d: int, name: str, i: int, text_align: str = "left"
) -> None:
    plt.axvline(x=u.epoch_for(y, m, d), color="gray", dashes=(1, 10), linewidth=1)
    plt.gca().annotate(
        name,
        (u.epoch_for(y, m, d), 170),
        color="gray",
        weight="ultralight",
        fontsize=9,
        ha=text_align,
        va="top",
        rotation=90,
    )


# sources:
# https://en.wikipedia.org/wiki/Public_holidays_in_South_Korea
# https://en.wikipedia.org/wiki/Buddha%27s_Birthday

g = sns.relplot(
    data=inc_daily_timeAge.reset_index(),
    col="age",
    kind="line",
    x="date",
    y="confirmed",
    col_wrap=3,
)

g = g.map(draw_public_holidays, "confirmed")

g.fig.subplots_adjust(top=0.9)
g.fig.suptitle("Confirmed cases per age group")

charts.rotate_x_labels(g)

Deceased¶

In [52]:
g = sns.relplot(
    data=inc_daily_timeAge.reset_index(),
    col="age",
    kind="line",
    x="date",
    y="deceased",
    col_wrap=5,
)
g = g.map(
    lambda y, **kw: plt.axhline(y=y.median(), color="w", dashes=(2, 1), linewidth=0.4),
    "deceased",
)

g.fig.subplots_adjust(top=0.9)
g.fig.suptitle("Deceased cases per age group")

charts.rotate_x_labels(g)

We can extract a few conclusions from this initial peek:

  • The most impacted population in terms of confirmed infections is skwewed towards the younger generations, particularly towards the 20s
  • The most impacted population in terms of deceased cases is skwewed towards the elder generations, particularly towards the 80s

Time Dataset¶

Let's take a look at the country-wide timeline to see if any obvious facts emerge from it.

In [53]:
time_cases_per_type = df_time.melt(
    id_vars=["date"], var_name="type", value_name="cases"
)
time_cases_per_type.head()
Out[53]:
date type cases
0 2020-01-20 test 1
1 2020-01-21 test 1
2 2020-01-22 test 4
3 2020-01-23 test 22
4 2020-01-24 test 27
In [54]:
sns.lineplot(
    data=time_cases_per_type, x="date", y="cases", hue="type", palette=covid_palette
)

plt.gca().add_patch(
    patches.Rectangle(
        (u.epoch_for(2020, 1, 30), -20000),
        150,
        80000,
        edgecolor="darkblue",
        facecolor="#f1f3ff",
        fill=True,
        lw=0.5,
    )
)

plt.ylabel("cases (in millions)")
plt.title("South Korea - Covid-19 - Country-wide Timeline")
Out[54]:
Text(0.5, 1.0, 'South Korea - Covid-19 - Country-wide Timeline')

South Korea managed to test over a million people during the first few months of the pandemic, and the vast majority of them were confirmed negative cases.

Let's zoom into the bottom of the chart so we can see the confirmed/released/deceased cases a bit better

In [55]:
noneg_notest = time_cases_per_type[
    ~time_cases_per_type["type"].isin(["test", "negative"])
]
sns.lineplot(data=noneg_notest, x="date", y="cases", hue="type", palette=covid_palette)

plt.ylabel("cases")
plt.title("South Korea - Covid-19 - Cumulative Country-wide Timeline")
Out[55]:
Text(0.5, 1.0, 'South Korea - Covid-19 - Cumulative Country-wide Timeline')

South Korea as a role model for this pandemic

There are various elements that can bring us to this conclusion:

  • We can see that South Korea has been very proactive, with early testing
  • We can also see that the overall number of deceased cases has been relatively low, compared to other countries.

In future analysis we will try to find out the patterns that brought them to these much-better-than-average numbers so we can implement a similar policy in our country

Weather dataset¶

We suspect that the searchterms dataframe will have a seasonal nature to its patterns, let's visualize Korea's natural temperature cycles to get a better understanding.

Since South Korea has a temperate climate with a wide range of temperatures, we will use a rolling-average to smooth out the temperatures, so we can see the general trend instead of the precise values.

In [56]:
temperatures = df_weather[
    ["province", "date", "min_temp", "avg_temp", "max_temp"]
].rename(columns={"min_temp": "min", "avg_temp": "avg", "max_temp": "max"})
In [57]:
korea_temperatures = temperatures[["province", "date", "avg"]]
temps_rolling = (
    korea_temperatures.set_index(["date", "province"])
    .unstack("province")
    .rolling(30, center=True)
    .mean()
    .stack("province")
)

province_palette = {p: "#b7b7b7" for p in korea_temperatures["province"].unique()}
province_width = {p: 0.1 for p in korea_temperatures["province"].unique()}

province_palette["Seoul"] = "#ff7c00"
province_width["Seoul"] = 1.5

plt.figure(figsize=(10, 6))
sns.lineplot(
    data=temps_rolling.reset_index(),
    x="date",
    y="avg",
    hue="province",
    palette=province_palette,
    size="province",
    sizes=province_width,
    legend=False,
)
plt.title("Avg Temperature (Seoul vs other provinces)")
Out[57]:
Text(0.5, 1.0, 'Avg Temperature (Seoul vs other provinces)')

It seems that Seoul has more extreme temperatures than the rest of the country. It seems like a good candidate to find seasonal changes (winter/summer).

Let's drill down one more level and see the temperature range

In [58]:
seoul_temperatures = temperatures[temperatures["province"] == "Seoul"].drop(
    columns="province"
)

temps_palette = {
    "min": sns.color_palette()[0],
    "max": sns.color_palette()[1],
    "avg": sns.color_palette()[2],
}

rolling_window_size = 21
seoul_rolling = seoul_temperatures.rolling(
    rolling_window_size, on="date", center=True
).mean()
seoul_rolling = seoul_rolling.melt(
    id_vars=["date"], var_name="measure", value_name="temperature"
)
plt.figure(figsize=(20, 5))
sns.lineplot(
    data=seoul_rolling,
    x="date",
    hue="measure",
    y="temperature",
    linewidth=0.3,
    palette=temps_palette,
    hue_order=["max", "avg", "min"],
)

plt.axhline(y=0, color="0", linewidth=0.4)
plt.title(f"Temperature in Seoul (rolling avg. over {rolling_window_size} days)")
Out[58]:
Text(0.5, 1.0, 'Temperature in Seoul (rolling avg. over 21 days)')

After seeing this, we could use the months with the lowest temperatures as "winter period" for other dataset's analysis

Now we're ready to compare these temperature changes with search trends

SearchTrend Dataset¶

In [59]:
searchterms_palette = {
    "cold": sns.color_palette()[0],
    "flu": sns.color_palette()[1],
    "pneumonia": sns.color_palette()[2],
    "coronavirus": sns.color_palette()[3],
}

dailyhits = df_searchTrend.melt(id_vars=["date"], var_name="term", value_name="hits")

g = sns.relplot(
    data=dailyhits.reset_index(),
    col="term",
    kind="line",
    x="date",
    y="hits",
    col_wrap=4,
    hue="term",
    palette=searchterms_palette,
)

g.fig.subplots_adjust(top=0.9)
g.fig.suptitle("Trending searches")

charts.rotate_x_labels(g)

Let's zoom into the lower part of the chart so we can see patterns or trends a bit more closely.

We'll also want to paint "winter" periods so we can more easily detect any seasonal behaviours related to cold/flus

In [60]:
upper_limit = 3.5
lower_limit = -0.05
# we want to highlight that covid searches are 0 for
# most the years and then shoot off in 2020


def paint_winters(series, **kw):
    paint_winter(u.epoch_for(2016, 1, 1))
    paint_winter(u.epoch_for(2017, 1, 1))
    paint_winter(u.epoch_for(2018, 1, 1))
    paint_winter(u.epoch_for(2019, 1, 1))
    paint_winter(u.epoch_for(2020, 1, 1))


def paint_winter(day):
    plt.gca().add_patch(
        patches.Rectangle(
            (day - 30, lower_limit),
            90,
            upper_limit - lower_limit,
            edgecolor="darkblue",
            facecolor="#f8f8f8",
            fill=True,
            lw=0.5,
        )
    )
    plt.gca().annotate(
        "winter",
        (day + 15, upper_limit * 0.99),
        color="black",
        weight="ultralight",
        fontsize=9,
        ha="center",
        va="top",
    )


g = sns.FacetGrid(
    dailyhits.reset_index(),
    col="term",
    col_wrap=2,
    sharey=False,
    height=3,
    aspect=3,
    ylim=(lower_limit, upper_limit),
    xlim=(u.epoch_for(2016, 1, 1), u.epoch_for(2020, 7, 30)),
)


g.map(sns.lineplot, "date", "hits", "term", palette=searchterms_palette)

g = g.map(paint_winters, "hits")

charts.rotate_x_labels(g, 90, 2)

In this quick visualization we can spot some interesting patterns:

  • FLU shows a very strong seasonality
  • COLD shows no seasonality, probably because cold is a polysemic word, and might have been used to refer to temperature (weather or otherwise)
  • CORONAVIRUS shows almost no nits, until the beginning of the pandemic. It's unclear what the initial hit at the end of 2018 indicates.
    • It's very likely to be an outlier, given that there's only a 1-day hit with such a high value (3.16772)
    • Since the dataset we have does not have any data for 2018, we will not be able to make any educated guesses
    • Alternatively, it could be a sudden spike after a 2018 outbreak of a different strain of coronavirus
  • PNEUMONIA shows minor seasonality (a lot less than cold), but it shows another interesting pattern:
    • its search hits were off the charts at the same time as the CORONAVIRUS term.
    • it might indicate that lots of people had symptoms but did not know it was COVID and looked for the closest concept/term.

If we decide to use search terms for future analysis, we will likely need to smartly merge COVID + PNEUMONIA hits (for the year 2020), as that spike in PNEUMONIA search hits is likely a misnomer for CORONAVIRUS, before the term became widely known

Policy dataset¶

Something else we might be able to use for our timeline analysis is the Policy Dataset.

This dataset contains korean policies enacted during the initial months of the Covid pandemic.

The hope is that we can find one or more policies that helped curb down some of the confirmed cases shown in the previous trends.

In [61]:
policies = df_policy.drop(columns=["policy_id", "country"])
policies["end_date"] = policies["end_date"].fillna(dataset_creation_date)
policies["policy_detail"] = (
    policies["gov_policy"].str.slice(0, 30)
    + " - "
    + policies["detail"].str.slice(0, 30)
)
policies.head()
Out[61]:
type gov_policy detail start_date end_date policy_detail
0 Alert Infectious Disease Alert Level Level 1 (Blue) 2020-01-03 2020-01-19 Infectious Disease Alert Level - Level 1 (Blue)
1 Alert Infectious Disease Alert Level Level 2 (Yellow) 2020-01-20 2020-01-27 Infectious Disease Alert Level - Level 2 (Yellow)
2 Alert Infectious Disease Alert Level Level 3 (Orange) 2020-01-28 2020-02-22 Infectious Disease Alert Level - Level 3 (Orange)
3 Alert Infectious Disease Alert Level Level 4 (Red) 2020-02-23 2020-05-31 Infectious Disease Alert Level - Level 4 (Red)
4 Immigration Special Immigration Procedure from China 2020-02-04 2020-05-31 Special Immigration Procedure - from China
In [62]:
default_palette = px.colors.qualitative.Plotly
policy_palette = {
    "Alert": "red",
    "Immigration": Amber,
    "Health": default_palette[2],
    "Social": default_palette[3],
    "Education": default_palette[5],
    "Administrative": default_palette[7],
    "Technology": "#9a9a9a",
    "Transformation": "#54a24b",
}

fig = px.timeline(
    policies,
    x_start="start_date",
    x_end="end_date",
    y="policy_detail",
    opacity=0.5,
    facet_col_spacing=1,
    color="type",
    height=1500,
    text="detail",
    color_discrete_map=policy_palette,
)

fig.show()

Developing Hypothesis¶

Now that we have a general idea of the datasets we have, we can start looking into the data to find patterns, and see if our expectations hold true.

Given that we will not be doing statistical experiments, we will not call them "Hypothesis", we will call them "Expectations", and we will try to validate them with the data we have.

Expectations¶

Here's a list of general insights we expect to find in the dataset provided.

  1. [x] Strong social distancing policies helped reduce the number of infections
  2. [x] Elder population are most at-risk of serious illness or death if they contract COVID-19
  3. [x] Young adults are the largest spreader of confirmed cases
  4. [x] Most of the infections (hotspots) begin around Seoul and adjacent cities.

Social Distancing Policies¶

Expectation:

Strong social distancing policies helped reduce the number of infections

Let's take at look at the timelines between social distancing policies and the number of cases. Let's start by age group, and then by province.

By Age Group¶

In [63]:
df_social_distancing = policies[policies["gov_policy"] == "Social Distancing Campaign"]

incubation_days_from = 2
incubation_days_until = 14
incubation_period = incubation_days_from


def draw_policies(y, policy_height=170, **kw):
    # plt.axhline(y=y.max(), color="r", dashes=(2, 1), linewidth=0.4)
    draw_policy(df_social_distancing.iloc[0], policy_height)
    draw_policy(df_social_distancing.iloc[1], policy_height)
    draw_policy(df_social_distancing.iloc[2], policy_height)
    draw_policy(df_social_distancing.iloc[3], policy_height)


def draw_policy(policy, policy_height: int) -> None:
    start = policy["start_date"]
    start_date = u.epoch_for(start.year, start.month, start.day) + incubation_period
    duration = (policy["end_date"] - policy["start_date"]) / np.timedelta64(1, "D")
    color = "#f69e87" if policy["detail"] == "Strong" else "#fddb99"
    plt.gca().add_patch(
        patches.Rectangle(
            (start_date, 0),
            duration,
            policy_height + 5,
            facecolor=color,
            fill=True,
            lw=0.5,
        )
    )
    plt.gca().annotate(
        policy["detail"],
        (start_date + 2, policy_height),
        color="white",
        fontsize=10,
        va="top",
        rotation=90,
    )


def draw_policy_custom_height(policy_height: int):
    def curried(policy, **kw):
        draw_policies(policy, policy_height, **kw)

    return curried
In [64]:
g = sns.relplot(
    data=inc_daily_timeAge.reset_index(),
    col="age",
    kind="line",
    x="date",
    y="confirmed",
    col_wrap=3,
)

g = g.map(draw_policies, "confirmed")

g.fig.subplots_adjust(top=0.9)
g.fig.suptitle("Confirmed cases per age group")

charts.rotate_x_labels(g)

There seems to be some indication that strong social distancing policies contributed to a reduction in confirmed cases among all age groups. The decrease in cases is very notable with "Strong" social distancing policies.

By Province¶

Let's do the same analysis by province, just to be extra sure.

We expect the same downward trend in most if not all of them, ideally, with a strong decline in the top 4 provinces we identified earlier.

In [65]:
daily_timeProvince = (
    df_timeProvince[["date", "province", "confirmed", "deceased"]]
    .set_index(["date", "province"])
    .unstack()
    .copy()
)
inc_daily_timeProvince = daily_timeProvince.diff()
inc_daily_timeProvince = inc_daily_timeProvince.stack().swaplevel().sort_index()
inc_daily_timeProvince.head(5)
Out[65]:
confirmed deceased
province date
Busan 2020-01-21 0 0
2020-01-22 0 0
2020-01-23 0 0
2020-01-24 0 0
2020-01-25 0 0
In [66]:
g = sns.relplot(
    data=inc_daily_timeProvince.reset_index(),
    col="province",
    kind="line",
    x="date",
    y="confirmed",
    col_wrap=4,
)

g = g.map(draw_policy_custom_height(600), "confirmed")

g.fig.subplots_adjust(top=0.95)
g.fig.suptitle("Confirmed cases per province")

charts.rotate_x_labels(g)

Just as we expected, the policies' timelines overlap in time with the strong decline across all the provinces, but more importantly, with the top 4 impacted regions that we identified earlier.

Conclusion: There seems to be a reduction in cases during the periods where Strong social distancing policies were in place, both when looking at provinces and when looking at age ranges.

This is however not enough to categorically make claims, as there could have been other factors (other non-gov policies, workplace/social changes, etc..). We can keep it as a possibility or a contributing factor as the current data does not allow us to discard it.

Which population is most at risk¶

Expectation:

Elder population (people aged 65 and above) are most at-risk of serious illness or death if they contract COVID-19

We don't have any data regarding "serious illness" so we can only track deceased statistics, but we have reasons to believe that the same patterns will emerge if we had data around "serious illness/health complications".

At the very least, this vector of exploration will give us a "best case scenario", showing us "the least terrible picture possible in an ideal world".

Let's divide the population into clusters, depending on their age:

  • young people: (0 to 29)
  • middle aged people: (30 to 69)
  • 70s: between 70-79
  • 80s: 80 and above
In [67]:
def age_group_mapper(age):
    if age in ["70s", "80s"]:
        return age
    elif age in ["30s", "40s", "50s", "60s"]:
        return "middle-aged people"
    elif age in ["0s", "10s", "20s"]:
        return "young people"
    else:
        return "unknown age group"


df_timeAge["age_group"] = df_timeAge["age"].map(age_group_mapper)
df_timeAge.head(10)
Out[67]:
date age confirmed deceased age_group
0 2020-03-02 0s 32 0 young people
1 2020-03-02 10s 169 0 young people
2 2020-03-02 20s 1235 0 young people
3 2020-03-02 30s 506 1 middle-aged people
4 2020-03-02 40s 633 1 middle-aged people
5 2020-03-02 50s 834 5 middle-aged people
6 2020-03-02 60s 530 6 middle-aged people
7 2020-03-02 70s 192 6 70s
8 2020-03-02 80s 81 3 80s
9 2020-03-03 0s 34 0 young people
In [68]:
df_timeAge["deceased_ratio"] = 100 * df_timeAge["deceased"] / df_timeAge["confirmed"]
df_timeAge.head(10)
Out[68]:
date age confirmed deceased age_group deceased_ratio
0 2020-03-02 0s 32 0 young people 0.0
1 2020-03-02 10s 169 0 young people 0.0
2 2020-03-02 20s 1235 0 young people 0.0
3 2020-03-02 30s 506 1 middle-aged people 0.197628
4 2020-03-02 40s 633 1 middle-aged people 0.157978
5 2020-03-02 50s 834 5 middle-aged people 0.59952
6 2020-03-02 60s 530 6 middle-aged people 1.132075
7 2020-03-02 70s 192 6 70s 3.125
8 2020-03-02 80s 81 3 80s 3.703704
9 2020-03-03 0s 34 0 young people 0.0
In [69]:
sns.lineplot(
    data=df_timeAge, x="date", y="deceased_ratio", hue="age_group", errorbar=None
)
plt.ylabel("decease ratio (%)")

charts.rotate_x_labels()

Conclusion: There seems to be a clear trend and separation between 70s and 80s. We cannot discard that age is a factor that could be related to severity of symptoms.

As we pointed earlier, this graph only tracks mortality, which means that the number of non-lethal cases (deceased + serious illness + life-changing complications will be worse than what we could diagram)

Infection vectors¶

Expectation:

Young adults are the largest spreader of confirmed cases

By Age Group¶

For this analysis we want to review case data to see if we can trace infections through the population and detect which population segment is the largest spreader

Seeing how 20-29s are the group with the most infections, we suspect that they are the ones who might be helping this virus spread around.

Let's look at direct infection cases to confirm/reject this expectation.

Locating the victims of infections

In [70]:
patient_contact = (
    df_patientInfo[df_patientInfo["infection_case"] == "contact with patient"]
    .reset_index()
    .copy()
)
patient_contact.head()
Out[70]:
patient_id sex age country province city infection_case infected_by confirmed_date released_date deceased_date state
0 1000000003 male 50s Korea Seoul Jongno-gu contact with patient 2002000001 2020-01-30 2020-02-19 NaT released
1 1000000005 female 20s Korea Seoul Seongbuk-gu contact with patient 1000000002 2020-01-31 2020-02-24 NaT released
2 1000000006 female 50s Korea Seoul Jongno-gu contact with patient 1000000003 2020-01-31 2020-02-19 NaT released
3 1000000007 male 20s Korea Seoul Jongno-gu contact with patient 1000000003 2020-01-31 2020-02-10 NaT released
4 1000000010 female 60s Korea Seoul Seongbuk-gu contact with patient 1000000003 2020-02-05 2020-02-29 NaT released

Locating the infectors

In [71]:
infector_lookup = df_patientInfo.reset_index()[
    ["patient_id", "age", "sex", "province", "city"]
].copy()
infector_lookup["patient_id"] = infector_lookup["patient_id"].astype(str)
infector_lookup = infector_lookup.add_prefix("infector_")
infector_lookup["infector_age_group"] = infector_lookup["infector_age"].map(
    age_group_mapper
)
infector_lookup.head()
Out[71]:
infector_patient_id infector_age infector_sex infector_province infector_city infector_age_group
0 1000000001 50s male Seoul Gangseo-gu middle-aged people
1 1000000002 30s male Seoul Jungnang-gu middle-aged people
2 1000000003 50s male Seoul Jongno-gu middle-aged people
3 1000000004 20s male Seoul Mapo-gu young people
4 1000000005 20s female Seoul Seongbuk-gu young people

Merging the data

In [72]:
infection_depth_1 = patient_contact.merge(
    infector_lookup, left_on="infected_by", right_on="infector_patient_id"
)
infection_depth_1
Out[72]:
patient_id sex age country province city infection_case infected_by confirmed_date released_date deceased_date state infector_patient_id infector_age infector_sex infector_province infector_city infector_age_group
0 1000000005 female 20s Korea Seoul Seongbuk-gu contact with patient 1000000002 2020-01-31 2020-02-24 NaT released 1000000002 30s male Seoul Jungnang-gu middle-aged people
1 1000000006 female 50s Korea Seoul Jongno-gu contact with patient 1000000003 2020-01-31 2020-02-19 NaT released 1000000003 50s male Seoul Jongno-gu middle-aged people
2 1000000007 male 20s Korea Seoul Jongno-gu contact with patient 1000000003 2020-01-31 2020-02-10 NaT released 1000000003 50s male Seoul Jongno-gu middle-aged people
3 1000000010 female 60s Korea Seoul Seongbuk-gu contact with patient 1000000003 2020-02-05 2020-02-29 NaT released 1000000003 50s male Seoul Jongno-gu middle-aged people
4 1000000017 male 70s Korea Seoul Jongno-gu contact with patient 1000000003 2020-02-20 2020-03-01 NaT released 1000000003 50s male Seoul Jongno-gu middle-aged people
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1225 6100000082 male 50s Korea Gyeongsangnam-do Geochang-gun contact with patient 6100000068 2020-03-07 2020-03-19 NaT released 6100000068 60s female Gyeongsangnam-do Geochang-gun middle-aged people
1226 6100000079 male 20s Korea Gyeongsangnam-do Changnyeong-gun contact with patient 6100000076 2020-03-06 NaT NaT released 6100000076 20s male Gyeongsangnam-do Changnyeong-gun young people
1227 6100000111 male 20s Korea Gyeongsangnam-do Sacheon-si contact with patient 6100000108 2020-04-06 NaT NaT released 6100000108 10s male Gyeongsangnam-do Sacheon-si young people
1228 6100000112 male 60s Korea Gyeongsangnam-do Hapcheon-gun contact with patient 6100000100 2020-04-07 NaT NaT released 6100000100 60s female Gyeongsangnam-do Jinju-si middle-aged people
1229 7000000011 male 30s Korea Jeju-do Jeju-do contact with patient 7000000009 2020-04-03 2020-05-19 NaT released 7000000009 20s female Jeju-do Jeju-do young people

1230 rows × 18 columns

In [73]:
sns.countplot(data=infection_depth_1, x="infector_age")
charts.rotate_x_labels()

It seems that our initial expectations were misguided. While we cannot confirm with certainty, due to the small number of datapoints we have, we can likely reject that it's the young (20s or less) that are doing the spreading. The chart above seems to indicate that we're off by an entire generation.

This could be for various reasons:

  • Maybe teenagers and young adults are doing online classes and are not going out as much as it seemed
  • Maybe the 40s/50s groups have jobs that cannot be done remotely
  • or maybe the level of testing in young adults is overrepresented and our first glance at the data brought us to incorrectly assume that all age groups were tested equally when reality was not as simple.

Let's visualize this over 2 dimensions to see the infector/infected vectors (in case there are any hidden patterns)

In [74]:
infection_count = pd.DataFrame(
    infection_depth_1[["age", "infector_age"]].value_counts()
).reset_index()
infection_count = infection_count.pivot(index="age", columns="infector_age", values=0)
infection_count = infection_count.drop(columns=["unknown"], index=["unknown"])
infection_count = infection_count.sort_index().sort_index(axis=1)
infection_count
Out[74]:
infector_age 0s 10s 20s 30s 40s 50s 60s 70s 80s 90s
age
0s 1.0 NaN NaN 15.0 8.0 NaN NaN NaN NaN NaN
10s NaN 16.0 9.0 4.0 19.0 3.0 2.0 3.0 NaN NaN
20s NaN 6.0 38.0 13.0 17.0 24.0 4.0 14.0 3.0 NaN
30s 1.0 2.0 9.0 38.0 21.0 10.0 10.0 9.0 4.0 NaN
40s 1.0 5.0 7.0 12.0 63.0 27.0 6.0 9.0 2.0 NaN
50s 1.0 7.0 29.0 11.0 29.0 49.0 23.0 25.0 11.0 1.0
60s NaN NaN 5.0 17.0 12.0 15.0 47.0 21.0 5.0 1.0
70s NaN 1.0 3.0 2.0 4.0 8.0 11.0 10.0 15.0 NaN
80s NaN NaN 3.0 NaN 2.0 2.0 6.0 7.0 11.0 1.0
90s NaN NaN NaN 1.0 NaN NaN NaN 1.0 7.0 NaN
In [75]:
ax = sns.heatmap(data=infection_count, cmap="magma_r", annot=True)
ax.set(xlabel="infector", ylabel="infected")
ax.xaxis.tick_top()
ax.xaxis.set_label_position("top")

No obvious patterns emerge other than the primary diagonal, which seems to indicate that most of the infections occur within the same age group.

Some age groups also have a secondary cross-group infection with groups 30 years apart:

  • 50s to 20s
  • 20s to 50s
  • 40s to 10s
  • 60s to 30s

We suspect this is due to Korea-specific metrics and represents age gaps between parents/children (infections within the same family).

Mother's age at first birth is 33 years old in South Korea.

By Province¶

In [76]:
infection_by_city = infection_depth_1[
    ["province", "city", "infector_province", "infector_city"]
].copy()
infection_by_city["infected"] = infection_by_city["province"].str.cat(
    infection_by_city["city"], sep="/"
)
infection_by_city["infector"] = infection_by_city["infector_province"].str.cat(
    infection_by_city["infector_city"], sep="/"
)
infection_by_city.head()
Out[76]:
province city infector_province infector_city infected infector
0 Seoul Seongbuk-gu Seoul Jungnang-gu Seoul/Seongbuk-gu Seoul/Jungnang-gu
1 Seoul Jongno-gu Seoul Jongno-gu Seoul/Jongno-gu Seoul/Jongno-gu
2 Seoul Jongno-gu Seoul Jongno-gu Seoul/Jongno-gu Seoul/Jongno-gu
3 Seoul Seongbuk-gu Seoul Jongno-gu Seoul/Seongbuk-gu Seoul/Jongno-gu
4 Seoul Jongno-gu Seoul Jongno-gu Seoul/Jongno-gu Seoul/Jongno-gu

Let's also trim down the dataset. Let's exclude intra-province infections and see if there are any hot trans-province trends.

In [77]:
infection_by_city_count = pd.DataFrame(infection_by_city.value_counts()).reset_index()
infection_by_city_count = infection_by_city_count[
    infection_by_city_count["infected"] != infection_by_city_count["infector"]
]
infection_by_city_count.head()
Out[77]:
province city infector_province infector_city infected infector 0
11 Chungcheongnam-do Cheonan-si Chungcheongnam-do Asan-si Chungcheongnam-do/Cheonan-si Chungcheongnam-do/Asan-si 18
16 Gyeonggi-do Bucheon-si Incheon Bupyeong-gu Gyeonggi-do/Bucheon-si Incheon/Bupyeong-gu 11
21 Incheon Michuhol-gu Incheon Bupyeong-gu Incheon/Michuhol-gu Incheon/Bupyeong-gu 9
26 Gyeonggi-do Gwangju-si Gyeonggi-do Seongnam-si Gyeonggi-do/Gwangju-si Gyeonggi-do/Seongnam-si 8
29 Gyeonggi-do Bucheon-si Seoul Nowon-gu Gyeonggi-do/Bucheon-si Seoul/Nowon-gu 7
In [78]:
infection_by_city_count = infection_by_city_count.pivot(
    index="infected", columns="infector", values=0
)
In [79]:
plt.figure(figsize=(10, 10))
ax = sns.heatmap(data=infection_by_city_count, cmap="magma_r")
ax.set(xlabel="infector", ylabel="infected")
ax.xaxis.tick_top()
ax.xaxis.set_label_position("top")

plt.title("trans-city infections")
charts.rotate_x_labels()

We can see some interesting insights:

  • We have removed intra-city infections
  • We still see 4 clusters:
    • Seout to Gyeonggi province infections
    • Inter Gyeonggi province infections
    • Inter Seoul province infections
    • Inter Inchen province infections
In [80]:
infection_by_province_count = pd.DataFrame(
    infection_by_city.drop(
        columns=["infected", "infector", "city", "infector_city"]
    ).value_counts()
).reset_index()
infection_by_province_count.head()
Out[80]:
province infector_province 0
0 Gyeonggi-do Gyeonggi-do 486
1 Incheon Incheon 145
2 Gyeonggi-do Seoul 109
3 Seoul Seoul 106
4 Gyeongsangbuk-do Gyeongsangbuk-do 98
In [81]:
infection_by_province_count = infection_by_province_count.pivot(
    index="province", columns="infector_province", values=0
)
infection_by_province_count = infection_by_province_count.sort_index().sort_index(
    axis=1
)
infection_by_province_count
Out[81]:
infector_province Busan Chungcheongbuk-do Chungcheongnam-do Daegu Daejeon Gwangju Gyeonggi-do Gyeongsangbuk-do Gyeongsangnam-do Incheon Jeju-do Jeollabuk-do Sejong Seoul Ulsan
province
Busan 25.0 NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN NaN
Chungcheongbuk-do NaN 8.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Chungcheongnam-do NaN NaN 89.0 NaN 4.0 NaN 1.0 NaN 1.0 NaN NaN NaN 1.0 1.0 NaN
Daegu NaN NaN NaN 3.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Daejeon NaN NaN 2.0 1.0 38.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
Gwangju NaN NaN NaN NaN NaN 16.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Gyeonggi-do 1.0 NaN NaN 4.0 4.0 NaN 486.0 NaN NaN 30.0 NaN NaN NaN 109.0 NaN
Gyeongsangbuk-do NaN NaN NaN 3.0 NaN NaN NaN 98.0 NaN NaN NaN NaN NaN NaN NaN
Gyeongsangnam-do NaN NaN NaN 1.0 NaN NaN NaN NaN 22.0 NaN NaN NaN NaN NaN NaN
Incheon NaN NaN NaN NaN NaN NaN NaN NaN NaN 145.0 NaN NaN NaN NaN NaN
Jeju-do NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN
Jeollabuk-do NaN NaN NaN NaN 1.0 1.0 NaN NaN NaN NaN NaN 2.0 NaN NaN NaN
Jeollanam-do NaN NaN NaN NaN NaN 2.0 NaN NaN NaN NaN NaN NaN NaN NaN 1.0
Sejong NaN NaN NaN NaN 2.0 NaN NaN NaN NaN NaN NaN NaN 6.0 NaN NaN
Seoul NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 106.0 1.0
Ulsan NaN NaN NaN NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN NaN 11.0
In [82]:
ax = sns.heatmap(data=infection_by_province_count, cmap="magma_r")
ax.set(xlabel="infector", ylabel="infected")
ax.xaxis.tick_top()
ax.xaxis.set_label_position("top")

plt.title("infection vectors by province")
charts.rotate_x_labels()

This chart shows the same conclusions as before, but at a lower resolution, providing more clarity and reducing noise.

We've included same-province infections so we can confirm the expectations we had when we did the trans-city analysis.

Let's try to visualize these hotspots over a real map, so we can get a better understanding.

Geographic spread¶

Expectation:

Most of the infections (hotspots) begin around Seoul and adjacent cities.

We will need to visualize geographical data to make some sense of this

In [83]:
infection_depth_1.head()
Out[83]:
patient_id sex age country province city infection_case infected_by confirmed_date released_date deceased_date state infector_patient_id infector_age infector_sex infector_province infector_city infector_age_group
0 1000000005 female 20s Korea Seoul Seongbuk-gu contact with patient 1000000002 2020-01-31 2020-02-24 NaT released 1000000002 30s male Seoul Jungnang-gu middle-aged people
1 1000000006 female 50s Korea Seoul Jongno-gu contact with patient 1000000003 2020-01-31 2020-02-19 NaT released 1000000003 50s male Seoul Jongno-gu middle-aged people
2 1000000007 male 20s Korea Seoul Jongno-gu contact with patient 1000000003 2020-01-31 2020-02-10 NaT released 1000000003 50s male Seoul Jongno-gu middle-aged people
3 1000000010 female 60s Korea Seoul Seongbuk-gu contact with patient 1000000003 2020-02-05 2020-02-29 NaT released 1000000003 50s male Seoul Jongno-gu middle-aged people
4 1000000017 male 70s Korea Seoul Jongno-gu contact with patient 1000000003 2020-02-20 2020-03-01 NaT released 1000000003 50s male Seoul Jongno-gu middle-aged people
In [84]:
lat_long_lookup = (
    df_region[["province", "city", "latitude", "longitude"]]
    .copy()
    .set_index(["province", "city"])
)
lat_long_lookup
Out[84]:
latitude longitude
province city
Seoul Seoul 37.566953 126.977977
Gangnam-gu 37.518421 127.047222
Gangdong-gu 37.530492 127.123837
Gangbuk-gu 37.639938 127.025508
Gangseo-gu 37.551166 126.849506
... ... ... ...
Gyeongsangnam-do Haman-gun 35.272481 128.40654
Hamyang-gun 35.520541 127.725177
Hapcheon-gun 35.566702 128.16587
Jeju-do Jeju-do 33.488936 126.500423
Korea Korea 37.566953 126.977977

244 rows × 2 columns

In [85]:
@lru_cache(maxsize=512)  # use memoization to speed up lookups
def latlong_lookup(province: str, city: str, attribute: str) -> float:
    try:
        result = lat_long_lookup.loc[(province, city), attribute]
        return result
    except:
        return np.nan
In [86]:
infection_vectors = infection_depth_1[
    ["province", "city", "infector_province", "infector_city"]
].copy()


def enrich_with_coordinates(row):
    row["infected_latitude"] = latlong_lookup(row["province"], row["city"], "latitude")
    row["infected_longitude"] = latlong_lookup(
        row["province"], row["city"], "longitude"
    )
    row["infector_latitude"] = latlong_lookup(
        row["infector_province"], row["infector_city"], "latitude"
    )
    row["infector_longitude"] = latlong_lookup(
        row["infector_province"], row["infector_city"], "longitude"
    )
    return row


# TODO remove slow operation
# OPTION 1 - this takes 2.5 sec
# infection_vectors = infection_vectors.apply(enrich_with_coordinates, axis=1)

# OPTION 2 - this takes 0.057 sec
infection_vectors["infected_latitude"] = infection_vectors.apply(
    lambda row: latlong_lookup(row["province"], row["city"], "latitude"), axis=1
)
infection_vectors["infected_longitude"] = infection_vectors.apply(
    lambda row: latlong_lookup(row["province"], row["city"], "longitude"), axis=1
)
infection_vectors["infector_latitude"] = infection_vectors.apply(
    lambda row: latlong_lookup(
        row["infector_province"], row["infector_city"], "latitude"
    ),
    axis=1,
)
infection_vectors["infector_longitude"] = infection_vectors.apply(
    lambda row: latlong_lookup(
        row["infector_province"], row["infector_city"], "longitude"
    ),
    axis=1,
)

infection_vectors = infection_vectors.dropna(axis=0)
infection_vectors
Out[86]:
province city infector_province infector_city infected_latitude infected_longitude infector_latitude infector_longitude
0 Seoul Seongbuk-gu Seoul Jungnang-gu 37.589562 127.016700 37.606832 127.092656
1 Seoul Jongno-gu Seoul Jongno-gu 37.572999 126.979189 37.572999 126.979189
2 Seoul Jongno-gu Seoul Jongno-gu 37.572999 126.979189 37.572999 126.979189
3 Seoul Seongbuk-gu Seoul Jongno-gu 37.589562 127.016700 37.572999 126.979189
4 Seoul Jongno-gu Seoul Jongno-gu 37.572999 126.979189 37.572999 126.979189
... ... ... ... ... ... ... ... ...
1225 Gyeongsangnam-do Geochang-gun Gyeongsangnam-do Geochang-gun 35.686526 127.910021 35.686526 127.910021
1226 Gyeongsangnam-do Changnyeong-gun Gyeongsangnam-do Changnyeong-gun 35.544603 128.492330 35.544603 128.492330
1227 Gyeongsangnam-do Sacheon-si Gyeongsangnam-do Sacheon-si 35.003668 128.064272 35.003668 128.064272
1228 Gyeongsangnam-do Hapcheon-gun Gyeongsangnam-do Jinju-si 35.566702 128.165870 35.180313 128.108750
1229 Jeju-do Jeju-do Jeju-do Jeju-do 33.488936 126.500423 33.488936 126.500423

1167 rows × 8 columns

In [87]:
animated_paths = True

map_lat_long = folium.Map(
    location=[36.0859, 127.9468], zoom_start=6.5, control_scale=True, width=800
)

for index, row in infection_vectors.iterrows():
    if animated_paths:
        plugins.AntPath(
            [
                [row["infector_latitude"], row["infector_longitude"]],
                [row["infected_latitude"], row["infected_longitude"]],
            ],
            weight=3,
        ).add_to(map_lat_long)
    else:
        folium.PolyLine(
            [
                [row["infector_latitude"], row["infector_longitude"]],
                [row["infected_latitude"], row["infected_longitude"]],
            ],
            weight=2,
        ).add_to(map_lat_long)

folium_utils.figure(map_lat_long)
Out[87]:

This graph does not really match what we expected! We remember seeing lots of cases around Gyeonggi-do, but the bulk of lines appear to be around Seoul. Let's find out why:

In [88]:
infection_vectors.value_counts().head(10)
Out[88]:
province           city           infector_province  infector_city  infected_latitude  infected_longitude  infector_latitude  infector_longitude
Gyeonggi-do        Seongnam-si    Gyeonggi-do        Seongnam-si    37.420000          127.126703          37.420000          127.126703            98
                   Bucheon-si     Gyeonggi-do        Bucheon-si     37.503393          126.766049          37.503393          126.766049            62
                   Gunpo-si       Gyeonggi-do        Gunpo-si       37.361653          126.935206          37.361653          126.935206            48
Chungcheongnam-do  Cheonan-si     Chungcheongnam-do  Cheonan-si     36.814980          127.113868          36.814980          127.113868            45
Gyeongsangbuk-do   Yecheon-gun    Gyeongsangbuk-do   Yecheon-gun    36.646707          128.437435          36.646707          128.437435            35
Incheon            Bupyeong-gu    Incheon            Bupyeong-gu    37.507031          126.721804          37.507031          126.721804            28
Gyeonggi-do        Suwon-si       Gyeonggi-do        Suwon-si       37.263376          127.028613          37.263376          127.028613            28
Gyeongsangbuk-do   Gyeongju-si    Gyeongsangbuk-do   Gyeongju-si    35.856185          129.224796          35.856185          129.224796            26
Gyeonggi-do        Pyeongtaek-si  Gyeonggi-do        Pyeongtaek-si  36.992293          127.112709          36.992293          127.112709            22
Incheon            Michuhol-gu    Incheon            Michuhol-gu    37.463572          126.650270          37.463572          126.650270            22
dtype: int64

There are two issues:

  1. There are lots of cases with identical overlapping (row 1 results in 98 lines being rendered perfectly on top of each other)
  2. All cases within the same province-city are not rendered because they have identical coordinates

Let's add a couple of improvements to help visualization:

  1. We can add some minor jitter so they are rendered independently
  2. We can circle/highlight the top focus areas
In [89]:
animated_paths = True

map_lat_long = folium.Map(
    location=[36.0859, 127.9468], zoom_start=6.5, control_scale=True, width=900
)


def jitter(val: float) -> float:
    min = -0.03
    max = 0.03
    return val + ((random() * (max - min)) + min)


for index, row in infection_vectors.iterrows():
    config = {
        "locations": [
            [jitter(row["infector_latitude"]), jitter(row["infector_longitude"])],
            [jitter(row["infected_latitude"]), jitter(row["infected_longitude"])],
        ],
        "weight": 3,
        "opacity": 0.3,
    }
    if animated_paths:
        plugins.AntPath(**config).add_to(map_lat_long)
    else:
        folium.PolyLine(**config).add_to(map_lat_long)


top_hotspots = infection_vectors.value_counts().head(10)
for index, row in top_hotspots.items():
    city = index[1]
    lat = index[4]
    long = index[5]
    folium.Circle(
        radius=5000, location=[lat, long], color="red", fill_color="red", tooltip=city
    ).add_to(map_lat_long)

folium_utils.figure(map_lat_long)
Out[89]:

Conclusions¶

Let's review the list of expectations/theories we started with and assess whether they can be rejected or they warrant further investigation.

This is the original list we had:

Revisiting our initial Expectations¶

Strong social distancing policies helped reduce the number of infections ✅ The data shows that there is a sharp drop of confirmed infections once the social distancing policies are implemented.

The relatively slow decline (over a couple of weeks) could be attributed to the virus' inherent incubation period (between 2 and 14 days).

Given the data, we cannot reject this claim and we could do further research on it.

Elder population are most at-risk of serious illness or death if they contract COVID-19 ✅

Even though the young generations were the ones with most confirmed cases of infection, the elder population took the biggest hit in terms of deceased cases.

The situation reaches catastrofic numbers when we look at the ratios instead of the absolute numbers, with 80+ year old reaching almost a 25% mortality rate.

These numbers might have a bias because we don't if the different groups we are comparing had similar rates of testing: while we do have a country-wide tested/confirmed/deceased cases, we do not have a breakdown of tested-cases-per-age-group.

Getting this data might help future analysis understand if there is a bias in this dataset or not.

Young adults are the largest spreader of confirmed cases ❌

Contrary to what we expected, young people are not the primary vector for infection.

Ignoring the notably sized group of people that is missing "infected_by" data, the group with the most infections is the 40-59 years old.

A cross-age analysis showed that most of the infections ocurr within the same age group.

Some age groups also have a secondary cross-group infection with groups 30 years apart.

We suspect this is due to Korea-specific metrics and represents age gaps between parents/children (infections within the same family).

Mother's age at first birth is 33 years old in South Korea.

Most of the infections (hotspots) begin around Seoul and adjacent cities. ✅

As expected most of the hotspots included Seoul and adjacent cities.

An in depth analysis allowed us to identify cross-province and cross-city infections, intra-city clusters, as well as see point-to-point infections.

Executive Summary¶

This dataset includes South Korean data (population, infection, country statistics, etc...) during the first wave of the COVID-19 pandemic.

Despite the large population density, South Korea managed to have an outstanding performance and very low mortality rates.

This was aided by several factors:

  • Mass testing facilities and equipment to keep the pulse of the situation
  • Strong social distancing policies to limit spread
    • This proved to work in both largely-densely populated areas as well as sparsely populated ones.

Suggested Policies¶

  1. Implement and enable broad and rapid covid testing.
    1. Ensure strong reserves of testing kits availability so we can sustain a high-% of testing for months.
    2. The rest of the measures will require us to have regular up-to-date data to keep monitoring our country's progress
  2. Implement strong social distancing policies to reduce spread.
    1. Starting "Strong" policies when cases are spiking (it's unclear what "Strong" meant in this dataset but following SK's example proved useful)
    2. Returning to "Weak" policies when cases drop, to avoid "strong policy" fatigue.
  3. Implement policies to protect population most at risk:
    1. ages 0-49: standard protection
    2. ages 50-69: strong protection policies
    3. ages 70+: very strong protection policies
  4. Regularly check cross-province travel to make sure hotspots do not infect adjacent or neighboring regions.
  5. Iterate
    1. Since we have only seen the first wave of the 2020, we must continuously test, so we can monitor and adapt to unexpected circumstances.

Future Research¶

  • It would be ideal to have a richer dataset:
    • with more granular data (eg. show tested cases broken down by age): so we can see if the small skewness we see are due to an inherent trend in the population or a sampling bias.
    • with more gender-specific data. It's still unclear why the large spike in confirmed cases impacted over +2000 female patients but if there are systemic issues that increase their risk, we would like to identify those so they can be corrected. With the current dataset, we can only speculate.
  • This dataset is missing the PatientRoute CSV file that was removed from kaggle due to privacy concerns.
    • In future analysis we could explore similar datasets to see if they match the trends we have shown on our maps.